Lexical Normalisation of Short Text Messages: Makn Sens a #twitter
نویسندگان
چکیده
Twitter provides access to large volumes of data in real time, but is notoriously noisy, hampering its utility for NLP. In this paper, we target out-of-vocabulary words in short text messages and propose a method for identifying and normalising ill-formed words. Our method uses a classifier to detect ill-formed words, and generates correction candidates based on morphophonemic similarity. Both word similarity and context are then exploited to select the most probable correction candidate for the word. The proposed method doesn’t require any annotations, and achieves state-of-the-art performance over an SMS corpus and a novel dataset based on Twitter.
منابع مشابه
Han, Bo, Paul Cook and Timothy Baldwin (2013) Lexical Normalisation of Short Text Messages, ACM Transactions on Intelligent Systems and Technology 4(1), pp. 5:1¡1⁄25:27
Twitter provides access to large volumes of data in real time, but is notoriously noisy, hampering its utility for NLP. In this paper, we target out-of-vocabulary words in short text messages and propose a method for identifying and normalising lexical variants. Our method uses a classifier to detect lexical variants, and generates correction candidates based on morphophonemic similarity. Both ...
متن کاملAutomatically Constructing a Normalisation Dictionary for Microblogs
Microblog normalisation methods often utilise complex models and struggle to differentiate between correctly-spelled unknown words and lexical variants of known words. In this paper, we propose a method for constructing a dictionary of lexical variants of known words that facilitates lexical normalisation via simple string substitution (e.g. tomorrow for tmrw). We use context information to gen...
متن کاملImproving the utility of social media with Natural Language Processing
Social media has been an attractive target for many natural language processing (NLP) tasks and applications in recent years. However, the unprecedented volume of data and the non-standard language register cause problems for off-the-shelf NLP tools. This thesis investigates the broad question of how NLP-based text processing can improve the utility (i.e., the effectiveness and efficiency) of s...
متن کاملMelbourne Language Group Microblog Track Report
This report outlines the TREC 2011 microblog track submission of the Language Technology Group at The University of Melbourne. The microblog track is an ad–hoc retrieval task over Twitter data with temporally-specified queries, and the requirement that all results must predate the query. Our objective is to establish baseline results for the task and study the relative impact of various factors...
متن کاملThematic Representation of Short Text Messages with Latent Topics: Application in the Twitter context
The amount of information exchanged over the Internet is continuously growing, taking the form of short text messages on microblogging platforms such as Twitter. Due to the limited size of these types of messages, their understanding may require to know the context of their occurrence. In this paper, we propose a higher-level representation of short text messages based on a thematic model obtai...
متن کامل